-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export ActionSet Metrics for Kanister #1603
Conversation
Thanks for submitting this pull request 🎉. The team will review it soon and get back to you. If you haven't already, please take a moment to review our project contributing guideline and Code of Conduct document. |
|
||
// Event describes an individual event. | ||
// | ||
// Note: The type and labels are private in order to force the use of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see the portion of comment block here at top of this file to explain the relationship between MetricType and constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/controller/controller.go
Outdated
@@ -209,6 +213,15 @@ func (c *Controller) onAddActionSet(as *crv1alpha1.ActionSet) error { | |||
if err := validate.ActionSet(as); err != nil { | |||
return err | |||
} | |||
// ActionSet created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code might be better placed within c.handleActionSet (after determining status is pending, before updating status to be running).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few thoughts, not a complete review
pkg/metrics/metrics.go
Outdated
} | ||
|
||
// Initialize a Prometheus GaugeVec for one metric and register it | ||
// nolint:all // Function is for expanding on metrics to introduce Gauges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling all linters for this entire function seems rather dangerous, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: is this function being used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is dangerous. The function is not being used anywhere and I feel it would be safe to remove it, Can be re-written easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/metrics/metrics.go
Outdated
if counterVecs[e.eventType].With(e.labels) == nil { | ||
return fmt.Errorf("%s Labels for %s Event Type not found", e.labels, e.eventType) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think (*CounterVec).With
can fail in this way, can it?
At very least, we should probably capture the returned counter and increment it below (rather than calling With
a second time to have it recreate the counter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"github.com/prometheus/client_golang/prometheus" | ||
) | ||
|
||
var counterVecs = make(map[MetricType]*prometheus.CounterVec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a global, it would be more idiomatic Go to have a NewCounterVecMap
function that returns a (probably opaque) *struct that contains the allocation.
That pointer would be stored in Controller. And then InitAllCounterVecs and IncrementCounterVec would be methods on *CounterVecMap.
Otherwise we have a singleton global that can be harder to test and assumes that there is only ever 1 such map in an image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. I would like to see the change from global to dynamically allocated struct before approving.
If @pavannd1 wants to approve as is I won't be upset.
pkg/metrics/events.go
Outdated
var MetricTypeOpts = map[MetricType]MetricTypeOpt{ | ||
SampleCountType: { | ||
EventFunc: NewSampleCount, | ||
Help: "Sample counter to remove later", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Done.
pkg/metrics/events.go
Outdated
ActionSetBackupCreatedType MetricType = "actionset_backup_created_count" | ||
ActionSetBackupCompletedType MetricType = "actionset_backup_completed_count" | ||
|
||
ActionSetRestoreCreatedType MetricType = "actionset_restore_created_count" | ||
ActionSetRestoreCompletedType MetricType = "actionset_restore_completed_count" | ||
|
||
ActionSetTotalCreatedType MetricType = "actionset_total_created_count" | ||
ActionSetTotalCompletedType MetricType = "actionset_total_completed_count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics name should be prefixed with kanister_
per Prometheus doc.
Instead of the _created_count
suffix, we should use _created_total
. _count
is used for histograms and summaries types. See here.
It's also better to instrument for failures_total
than completed_count
, to make calculation of failure rates easier. See slide 9 & 10 of this slides and the Prometheus doc.
I don't think we need separate backup and restore metrics. They can be represented as labels (with function names as values) in the kanister_actionset_completed_total
and kanister_actionset_failures_total
metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note: …_failures_total
doesn't stand on its own if the intention is to calculate failure rate. A separate completed count would need to be maintained, wouldn't it?
It appears that the way the original …_completed_count
was implemented would still accomplish the same goal since it includes a state
that indicates whether it failed or not. This should allow the rate to be calculated as rate(…_completed_count{state="failure"}[5m]) / rate(…_completed_count[5m])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume _created_count
is what will be used with _failures_total
to calculate failure rates. The state
label works too, but then why have a separate created
metric? I interpret completed as succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure if/what value the separate created
metric provides.
But it did seem weird to me to calculate a rate based upon created vs. completed + failed (especially since there's a strange temporal dissonance there…). It seemed more logical to me to compare the number of completed vs. completed + failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is completed + failed the same as "created", which is the same as "total"? In my mind, they are. As a start, I want to focus on the error rates i.e., rate(actionset_failures_total[5m]) / rate(actionset_total[5m])
. It's more commonly used than non-error rates from a system alert perspective. The rate(actionset_total)
is also great for detecting saturation and spikes.
At the code level, ideally, there are two places that call Inc()
; one for requests_total
when the actionset is initiated, and one for failures_total
inside the caller's if err != nil
.
How would the state
label work on a counter metric, as the actionset goes through different phases throughout its lifecycle? Will there be one data point for each state?
}, | ||
} | ||
|
||
// Event describes an individual event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we are gaining much with this new Event
type. If I understand the intent here, it feels like a sample is closer to what you are trying to describe. It represents a specific instance value of the metric, at a given time. To me, events represent action, incident and occurrence recognized by the system such as those defined in this K8s list.
What do you think of merging Event
, MetricType
and MetricTypeOpt
into a Metric
and Sample
structs? E.g.,
type Metric struct {
Name string
Help string
LabelNames []string
}
type Sample struct {
Name string
Labels prometheus.Labels
}
Here is a diff with suggestions on what the code can look like with the two types:
diff --git a/pkg/controller/controller.go b/pkg/controller/controller.go
index ff5a0a10..94e65aad 100644
--- a/pkg/controller/controller.go
+++ b/pkg/controller/controller.go
@@ -62,7 +62,7 @@ type Controller struct {
config *rest.Config
crClient versioned.Interface
clientset kubernetes.Interface
- counterVecs map[metrics.MetricType]*prometheus.CounterVec
+ counterVecs map[string]*prometheus.CounterVec
dynClient dynamic.Interface
osClient osversioned.Interface
recorder record.EventRecorder
@@ -374,7 +374,8 @@ func (c *Controller) handleActionSet(as *crv1alpha1.ActionSet) (err error) {
// These labels will be passed to a function in the metrics package
// Also, a list of events to update metrics should be created and incremented.
for _, a := range as.Spec.Actions {
- if err := metrics.IncrementCounterVec(metrics.NewActionSetTotalCreated(a.Name, as.GetNamespace())); err != nil {
+ s := metrics.NewSampleActionSetTotal(a.Name, as.GetNamespace())
+ if err := metrics.IncrementCounterVec(s); err != nil {
log.Error().WithError(err).Print("Metrics Incrementation failed")
}
}
diff --git a/pkg/metrics/events.go b/pkg/metrics/events.go
index 81b341d1..65f026c6 100644
--- a/pkg/metrics/events.go
+++ b/pkg/metrics/events.go
@@ -1,5 +1,7 @@
package metrics
+import "github.com/prometheus/client_golang/prometheus"
+
// This file contains wrapper functions that will map a Prometheus metric names to its
// label field, help field and an associated Event.
@@ -8,18 +10,37 @@ package metrics
type MetricType string
const (
- SampleCountType MetricType = "sample_count"
-
- ActionSetBackupCreatedType MetricType = "actionset_backup_created_count"
- ActionSetBackupCompletedType MetricType = "actionset_backup_completed_count"
+ ActionSetBackupCreatedType = "actionset_backup_created_count"
+ ActionSetBackupCompletedType = "actionset_backup_completed_count"
- ActionSetRestoreCreatedType MetricType = "actionset_restore_created_count"
- ActionSetRestoreCompletedType MetricType = "actionset_restore_completed_count"
+ ActionSetRestoreCreatedType = "actionset_restore_created_count"
+ ActionSetRestoreCompletedType = "actionset_restore_completed_count"
- ActionSetTotalCreatedType MetricType = "actionset_total_created_count"
- ActionSetTotalCompletedType MetricType = "actionset_total_completed_count"
+ ActionSetTotalCreatedType = "actionset_total_created_count"
+ ActionSetTotalCompletedType = "actionset_total_completed_count"
)
+type Metric struct {
+ Name string
+ Help string
+ LabelNames []string
+}
+
+type Sample struct {
+ Name string
+ Labels prometheus.Labels
+}
+
+func NewSampleActionSetTotal(action string, namespace string) *Sample {
+ return &Sample{
+ Name: ActionSetTotalCompletedType,
+ Labels: prometheus.Labels{
+ "actionType": action,
+ "namespace": namespace,
+ },
+ }
+}
+
// MetricTypeOpt is a struct for a Prometheus metric.
// Help and LabelNames are passed directly to the Prometheus predefined functions.
// EventFunc holds the constructor of the linked Event of a given MetricType.
@@ -32,40 +53,32 @@ type MetricTypeOpt struct {
// Mapping a Prometheus MetricType to the metric MetricTypeOpt struct.
// Basically, a metric name is mapped to its associated Help and LabelName fields.
// The linked event function (EventFunc) is also mapped to this metric name as a part of MetricTypeOpt.
-var MetricTypeOpts = map[MetricType]MetricTypeOpt{
- SampleCountType: {
- EventFunc: NewSampleCount,
- Help: "Sample counter to remove later",
- LabelNames: []string{"sample"},
- },
-
+var Counters = map[string]*Metric{
ActionSetBackupCreatedType: {
- EventFunc: NewActionSetBackupCreated,
- Help: "The count of backup ActionSets created",
+ Name: ActionSetBackupCreatedType,
+ Help: "The count of backup ActionSets created",
},
ActionSetBackupCompletedType: {
- EventFunc: NewActionSetBackupCompleted,
+ Name: ActionSetBackupCompletedType,
Help: "The count of backup ActionSets completed",
LabelNames: []string{"state"},
},
-
ActionSetRestoreCreatedType: {
- EventFunc: NewActionSetRestoreCreated,
- Help: "The count of restore ActionSets created",
+ Name: ActionSetRestoreCreatedType,
+ Help: "The count of restore ActionSets created",
},
ActionSetRestoreCompletedType: {
- EventFunc: NewActionSetRestoreCompleted,
+ Name: ActionSetRestoreCompletedType,
Help: "The count of restore ActionSets completed",
LabelNames: []string{"state"},
},
-
ActionSetTotalCreatedType: {
- EventFunc: NewActionSetTotalCreated,
+ Name: ActionSetTotalCreatedType,
Help: "The count of total ActionSets created",
LabelNames: []string{"actionType", "namespace"},
},
ActionSetTotalCompletedType: {
- EventFunc: NewActionSetTotalCompleted,
+ Name: ActionSetTotalCompletedType,
Help: "The count of total ActionSets completed",
LabelNames: []string{"actionType", "namespace", "state"},
},
@@ -98,15 +111,6 @@ func (e *Event) Labels() map[string]string {
return labels
}
-func NewSampleCount(sample string) Event {
- return Event{
- eventType: SampleCountType,
- labels: map[<details>
<summary>Click me</summary>string]string{
- "sample": sample,
- },
- }
-}
-
func NewActionSetBackupCreated() Event {
return Event{
eventType: ActionSetBackupCreatedType,
diff --git a/pkg/metrics/metrics.go b/pkg/metrics/metrics.go
index 17a0bb55..f76c60f9 100644
--- a/pkg/metrics/metrics.go
+++ b/pkg/metrics/metrics.go
@@ -8,25 +8,19 @@ import (
"github.com/prometheus/client_golang/prometheus"
)
-var counterVecs = make(map[MetricType]*prometheus.CounterVec)
+var counterVecs = map[string]*prometheus.CounterVec{}
// Initialize a Prometheus CounterVec for one metric and register it
-func initCounterVec(r prometheus.Registerer, t MetricType) (*prometheus.CounterVec, error) {
- metricTypeOpts, ok := MetricTypeOpts[t]
-
- if !ok {
- panic(fmt.Sprintf("Event type %s is not defined", t))
- }
-
+func initCounterVec(r prometheus.Registerer, m *Metric) (*prometheus.CounterVec, error) {
opts := prometheus.CounterOpts{
- Name: string(t),
- Help: metricTypeOpts.Help,
+ Name: string(m.Name),
+ Help: m.Help,
}
- counterVec := prometheus.NewCounterVec(opts, metricTypeOpts.LabelNames)
+ counterVec := prometheus.NewCounterVec(opts, m.LabelNames)
err := r.Register(counterVec)
if err != nil {
- return nil, fmt.Errorf("%s not registered: %s ", t, err)
+ return nil, fmt.Errorf("%s not registered: %s ", m.Name, err)
}
var alreadyRegisteredErr prometheus.AlreadyRegisteredError
if errors.As(err, &alreadyRegisteredErr) {
@@ -39,26 +33,26 @@ func initCounterVec(r prometheus.Registerer, t MetricType) (*prometheus.CounterV
}
// Initialize all the Counter Vecs and save it in a map
-func InitAllCounterVecs(r prometheus.Registerer) map[MetricType]*prometheus.CounterVec {
- for metricType := range MetricTypeOpts {
- cv, err := initCounterVec(r, metricType)
+func InitAllCounterVecs(r prometheus.Registerer) map[string]*prometheus.CounterVec {
+ for _, counter := range Counters {
+ cv, err := initCounterVec(r, counter)
if err != nil {
log.WithError(err).Print("Failed to register metric %s")
return nil
}
- counterVecs[metricType] = cv
+ counterVecs[counter.Name] = cv
}
return counterVecs
}
// Increment a Counter Vec metric
-func IncrementCounterVec(e Event) error {
- if counterVecs[e.eventType] == nil {
- return fmt.Errorf("%s Event Type not found", e.eventType)
+func IncrementCounterVec(s *Sample) error {
+ if counterVecs[s.Name] == nil {
+ return fmt.Errorf("%s Event Type not found", s)
}
- cv := counterVecs[e.eventType].With(e.labels)
+ cv := counterVecs[s.Name].With(s.Labels)
if cv == nil {
- return fmt.Errorf("%s Labels for %s Event Type not found", e.labels, e.eventType)
+ return fmt.Errorf("%s Labels for %s Event Type not found", s.Labels, s.Name)
}
cv.Inc()
return nil
LMKWYT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like much of this was modeled after the counter metrics I added in K10, so while I can't speak much to this code, I can offer some insights into the K10 design.
There were two primary things considered in K10's design:
- How can we leverage the type system to prevent the Prometheus library panicking unexpectedly?
- How can we model the metrics to conceptually match what's going on in the system?
The Prometheus library has an annoying propensity to panic if a CounterVec
is requested with a different set of labels than it was instantiated with. I ran into this several times while prototyping the solution, so I decided to employ the type system to help protect against this.
This meant that the developer shouldn't be able to freely associate a metric name with labels (because that was the case where we kept running into trouble and causing panics).
So the metric type had a private metric name and set of labels that were instantiated via a constructor. This meant that we only needed to ensure that the constructors worked properly, which we did through a reflection based unit test.
As for conceptual models for the metrics, I elected to use "event" over "sample" because inherently the transitionary event between a previous state and a new state (in other words, it's representing the point at which an increase happens).
This is distinct from what I think of as a "sample": the measure of the overall state at an instantaneous point in time (not the transition between two states, just the measure of the state).
To put a different way, I thought of a "sample" more as reporting the current value (i.e. a gauge), whereas "events" were the accumulation of individual changes (i.e. a counter).
So if we're going to eschew the type protections, we should recognize what the consequences might be and we should decide if the terminology aligns with conceptual models (personally, "sample" doesn't seem to make as much sense to me in this context).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we toss away the type system prevention of Prometheus panics, please note that the Event structure facilitates writing a unit test that validates that the labels from each constructor match the labels that are used to initialize the CounterVec. Or so I've been told :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I am advocating that we tossed away the type system. In my diff, there is still a constructor to constrain the labels that can be passed in. E.g.,
func NewSampleActionSetTotal(action string, namespace string) *Sample {
// ....
}
It isn't obvious to me how my diff is hindering the type system protection you are referring to. My intention is still this:
What do you think of merging Event, MetricType and MetricTypeOpt into a Metric and Sample structs?
Re: conceptual model, I picked "sample" because it represents a data point instance in the timeseries. Yeah, I agree it doesn't align as well as I'd want, because in a TSDB, a sample of a metric time series is a k/v pair of timestamp to value (with all the label values defined). In this case, we only have the label values, not its value at that given time. Hence, I said, "closer to what you are trying to describe" 🙂. What I have in mind is like MetricData
, MetricDataPoint
(not saying those should be the name).
I am not convinced "event" is clearer; feels like this is where it gets subjective. I think we agree that type system protection is good, and we want some way to conceptually represent the metric and distinguish it from the metric instance at a given time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the concern with your diff is that the Name
and Labels
are public fields. Wouldn't we be worried about developers constructing Sample
s inline and not using the constructors, thus bypassing the type system?
--
I agree that I'm not convinced "event" is more clear and that it's a very subjective topic 😄 However, one additional consideration might be: is using "sample" here going to cut off the term we would want to use for gauges in the future (when we add gauge metrics)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unexporting the public fields sound actionable to me. Personally, I am not that worried about the panics caused by initialization with unknown labels 😄, because:
- I am not convinced that they won't be caught during developer testing and code review (assuming they happen 😄)
- Most common vectors methods like
With()
andWithLabelValues()
that panic, have a non-panicGet*
counterpart
I expect developers to learn and understand the packages they are using (through docs, code review, mentorship etc.), instead of relying on others to fail-proof things for them.
I am not opposed to unexporting the public fields though.
As for the usage of "sample", yeah, it's not a good choice. I prefer not to overload the TSDB definition of "sample", which is a k/v pair of timestamp to values, regardless of the metric type. I don't have a better name for it, so Event
is fine (it's more generic than "sample") until we can come up with something else.
pkg/metrics/events.go
Outdated
return labels | ||
} | ||
|
||
func NewSampleCount(sample string) Event { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Done.
// Initialize all the Counter Vecs and save it in a map | ||
func InitAllCounterVecs(r prometheus.Registerer) map[MetricType]*prometheus.CounterVec { | ||
for metricType := range MetricTypeOpts { | ||
cv, err := initCounterVec(r, metricType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all metricType
counters? If so, can we reflect that in the MetricTypeOpts
name? How about just call it Counters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/metrics/metrics.go
Outdated
metricTypeOpts, ok := MetricTypeOpts[t] | ||
|
||
if !ok { | ||
panic(fmt.Sprintf("Event type %s is not defined", t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we don't panic the controller here, just because some counters don't exist. Why not just return it as an error, and let the caller decide what to do with it (e.g., log it as an error)? See the Go code review wiki.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/metrics/metrics.go
Outdated
if errors.As(err, &alreadyRegisteredErr) { | ||
counterVec = alreadyRegisteredErr.ExistingCollector.(*prometheus.CounterVec) | ||
} else if err != nil { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return the error and let the higher level caller handle this. See the Go code review wiki.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -69,7 +72,8 @@ type Controller struct { | |||
// New create controller for watching kanister custom resources created | |||
func New(c *rest.Config) *Controller { | |||
return &Controller{ | |||
config: c, | |||
config: c, | |||
counterVecs: metrics.InitAllCounterVecs(prometheus.DefaultRegisterer), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the controller need to hold on to this? It doesn't look like the controller is doing anything with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this out without keeping the counters in the controller and metric reporting did not work.
More testing required on this. Or a designated Reporter() for all the metrics.
3fa6542
to
424ff9d
Compare
This PR is marked as stale due to inactivity. Add a new comment to reactivate it. |
This PR is closed due to inactivity. Feel free to reopen it, if it's still relevant. |
Change Overview
/metrics
.Pull request type
Please check the type of change your PR introduces:
Issues
Test Plan